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Abstract 


As  part  of  the  Face  Recognition  Technology  (FERET)  program,  the  U.S.  Army 
Research  Laboratory  (ARL)  conducted  supervised  government  tests  and  evalu¬ 
ations  of  automatic  face  recognition  algorithms.  The  goal  of  the  tests  was  to 
provide  an  independent  method  of  evaluating  algorithms  and  assessing  the 
state  of  the  art  in  automatic  face  recognition.  This  report  describes  the  design 
and  presents  the  results  of  the  August  1 994  and  March  1 995  FERET  tests.  Results 
for  FERET  tests  administered  by  ARL  between  August  1994  and  August  1996 
are  reported. 
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1.  Introduction 


The  primary  mission  of  the  Face  Recognition  Technology  (FERET)  pro¬ 
gram  is  to  develop  automatic  face  recognition  capabilities  that  can  be  em¬ 
ployed  to  assist  security,  intelligence,  and  law  enforcement  personnel  in 
the  performance  of  their  duties.  In  order  to  achieve  its  objectives,  the 
FERET  program  is  conducting  multiple  tasks  over  a  three-year  period 
from  September  1993.  The  FERET  program  is  sponsored  by  the  Depart¬ 
ment  of  Defense  Counterdrug  Technology  Development  Program  through 
the  Defense  Advanced  Research  Projects  Agency  (DARPA),  with  the  U.S. 
Army  Research  Laboratory  (ARL)  serving  as  teclmical  agent. 

The  program  has  focused  on  three  major  tasks.  The  first  major  FERET  task 
is  the  development  of  the  technology  base  required  for  a  face  recognition 
system. 

The  second  major  task,  which  began  at  the  start  of  the  FERET  program  and 
will  continue  throughout  the  program,  is  collecting  a  large  database  of  fa¬ 
cial  images.  This  database  of  facial  images  is  a  vital  part  of  the  overall 
FERET  program  and  promises  to  be  key  to  future  work  in  face  recognition, 
because  it  provides  a  standard  database  for  algorithm  development,  test, 
and  evaluation.  The  database  is  divided  into  two  parts:  the  development 
portion,  which  is  given  to  researchers,  and  the  sequestered  portion,  which 
is  used  to  test  algorithms. 

The  third  major  task  is  government-monitored  testing  and  evaluation  of 
face  recognition  algorithms  using  standardized  tests  and  test  procedures. 
Two  rounds  of  government  tests  were  conducted,  one  at  the  end  of  Phase  I 
(the  initial  development  phase,  ending  in  August  1994)  and  a  second  mid¬ 
way  through  Phase  II  (the  continuing  development  phase),  in  March  1995. 
(A  followup  test  was  administered  for  one  of  the  algorithms  in  August 
1996;  results  are  reported  in  app  A.) 

The  purpose  of  the  tests  was  to  measure  overall  progress  in  face  recogni¬ 
tion,  determine  the  maturity  of  face  recognition  algorithms,  and  have  an 
independent  means  of  comparing  algorithms.  The  tests  measure  the  ability 
of  the  algorithms  to  handle  large  databases,  changes  in  people's  appear¬ 
ance  over  time,  variations  in  illumination,  scale,  and  pose,  and  changes  in 
the  background.  The  algorithms  tested  are  fully  automatic,  and  the  images 
presented  to  the  algorithm  are  not  normalized.  If  an  algorithm  requires 
that  a  face  be  in  a  particular  position,  then  the  algorithm  must  locate  the 
face  in  the  image  and  transform  the  face  into  the  required  predetermined 
position. 

The  August  1994  evaluation  procedure  consisted  of  a  suite  of  three  tests. 
The  first  test  is  the  large  gallery  test.  A  gallery  is  the  collection  of  images  of 
individuals  known  to  the  algorithm,  and  a  probe  is  an  image  of  an  un¬ 
known  person  presented  to  the  algorithm.  In  the  August  1994  test,  the  gal¬ 
lery  consisted  of  317  individuals,  with  one  image  per  person,  and  in  the 
March  1995  test,  the  gallery  consisted  of  831  individuals,  with  one  image 
per  person.  The  differences  between  a  probe  image  and  a  gallery  image  of 
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a  person  include  changes  in  time  (the  images  were  taken  weeks  or  months 
apart);  changes  in  scale;  changes  in  illumination;  and  changes  in  pose. 

Images  in  the  FERET  database  were  taken  under  semi-controlled  condi¬ 
tions.  This  is  in  contrast  to  many  of  the  algorithms  in  the  literature,  where 
results  are  reported  for  small  databases  collected  xmder  highly  controlled 
conditions. 

The  second  and  third  tests  are  the  false-alarm  and  rotation  tests.  The  goal 
of  the  false-alarm  test  is  to  see  if  an  algorithm  can  successfully  differentiate 
between  probes  that  are  in  the  gallery  and  those  not  in  the  gallery.  The  ro¬ 
tation  test  measures  the  effects  of  rotation  on  recognition  performance. 

As  part  of  the  FERET  program,  a  procedure  was  instituted  to  allow  re¬ 
searchers  outside  the  FERET  program  to  gain  access  to  the  FERET  data¬ 
base  (see  app  B  for  details).*  Also,  researchers  can  request  to  take  the 
FERET  tests.  Results  of  future  tests  will  be  reported  in  supplements  to  this 
report  that  will  be  issued  as  needed. 

Future  FERET  tasks  will  include  the  development  of  real-time  systems  to 
demonstrate  face  recognition  in  real-world  situations.  These  demonstra¬ 
tion  systems  will  provide  the  needed  large-scale  performance  statistics  for 
evaluation  of  algorithms  in  real-world  situations.  This  decision  to  proceed 
with  the  development  of  real-time  systems  was  based  in  part  on  the  results 
from  the  March  1995  test. 

This  report  reviews  algorithms  developed  imder  the  FERET  program  and 
the  data  collection  activities,  and  reports  on  the  results  of  the  August  1994 
and  March  1995  government-supervised  tests. 


*At  the  time  of  the  test,  the  FERET  database  was  made  available  to  researchers  in  the  U.S.  on  a  case  by  case  basis.  Dis¬ 
tribution  was  restricted  to  the  U.S.  because  of  legal  issues  concerning  the  rights  of  individuals  to  their  facial  images. 
As  of  May  1996,  over  50  researchers  had  been  given  access  to  the  FERET  database. 
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2.  Overview 


The  object  of  the  FERET  program  is  to  develop  face  recognition  systems 
that  can  assist  intelligence,  security,  and  law  enforcement  personnel  in 
identifying  individuals  electronically  from  a  database  of  facial  images. 
Face  recognition  technology  could  be  useful  in  a  number  of  security  and 
law  enforcement  tasks: 

•  automated  searching  of  mug  books  using  surveillance  photos,  mug  shots, 
artist  sketches,  or  witness  descriptions; 

•  controlling  access  to  restricted  facilities  or  equipment; 

•  credentialing  of  personnel  for  background  and  security  checks; 

•  monitoring  areas  (airports,  border  crossings,  secure  manufacturing  facili¬ 
ties,  doorways,  hallways,  etc)  for  particular  individuals;  and 

•  finding  and  logging  multiple  appearances  of  individuals  over  time  in  sur¬ 
veillance  videos  (live  or  taped). 

Other  possible  government  and  commercial  uses  of  this  technology  could 
be 

•  verifying  identity  at  ATM  machines; 

•  verifying  identity  for  the  automated  issue  of  driver's  licenses;  and 

•  searching  photo  ID  records  for  fraud  detection  (multiple  driver's  licenses, 
multiple  welfare  claims,  etc). 

The  FERET  program  has  concentrated  on  two  scenarios.  The  first  is  the 
electronic  mug  book,  a  collection  of  images  of  known  individuals — in 
other  words,  a  gallery.  The  image  of  an  individual  to  be  identified  (a  probe) 
is  presented  to  an  algorithm,  which  reports  the  closest  matches  from  a 
large  gallery.  The  performance  of  the  algorithm  is  measured  by  its  ability 
to  correctly  identify  the  person  in  the  probe  image.  For  example,  an  image 
from  a  surveillance  photo  would  be  a  probe,  and  the  system  would  display 
the  photos  of  the  20  people  from  the  gallery  that  most  resembled  the  un¬ 
known  individual  in  the  surveillance  photo.  The  final  decision  concerning 
the  person's  identity  would  be  made  by  a  trained  law  enforcement  agent. 

The  second  scenario  is  the  identification  of  a  small  group  of  specific  indi¬ 
viduals  from  a  large  population  of  unknown  persons.  Applications  for  this 
type  of  system  include  access  control  and  the  monitoring  of  airports  for 
suspected  terrorists.  In  the  access  control  scenario,  when  an  individual 
walks  up  to  a  doorway,  his  or  her  image  is  captured,  analyzed,  and  com¬ 
pared  to  the  gallery  of  individuals  approved  for  access.  Alternatively,  the 
system  could  monitor  points  of  entry  into  a  building,  a  border  crossing,  or 
perhaps  an  airport  jetway,  and  search  for  smugglers,  terrorists,  or  other 
criminals  attempting  to  enter  surreptitiously.  In  both  situations,  a  large 
number  of  individuals  not  in  the  gallery  would  be  presented  to  the  system. 


9 


The  important  system  performance  measures  here  are  the  probabilities  of 
false  alarms  and  missed  recognitions.  A  false  alarm  occurs  when  the  algo¬ 
rithm  reports  that  the  person  in  a  probe  image  is  in  the  gallery  when  that 
person  is  not  in  fact  in  the  gallery.  A  missed  recognition  is  the  reverse:  the 
algorithm  reports  that  the  person  in  the  probe  is  not  in  the  gallery  when 
the  person  is  in  the  gallery,  or  identifies  the  person  as  the  wrong  person. 

The  primary  emphasis  of  the  FERET  program  has  been  to  establish  an  im- 
derstanding  of  the  current  state  of  the  art  in  face  recognition  from  frontal 
images  and  to  advance  it.  Additionally,  the  program  has  established  a 
baseline  for  the  performance  of  recognition  algorithms  on  rotated  facial 
images.  Later  phases  of  the  program  will  extend  successful  approaches  to 
the  task  of  identifying  individuals  when  facial  features  are  presented  in 
any  aspect  from  full  front  to  full  profile. 

To  address  these  tasks,  a  multiphase  program  was  instituted  by  DARPA, 
with  ARL  as  the  technical  agent.  In  Phase  I  (September  1993  through  Sep¬ 
tember  1994),  five  contracts  were  awarded  for  algorithm  development  and 
one  contract  for  database  collection.  Phase  II  continued  the  database  collec¬ 
tion  contract  and  exercised  options  on  three  of  the  algorithm  development 
contracts. 

Before  the  start  of  the  FERET  program,  there  was  no  way  to  accurately 
evaluate  or  compare  the  face  recognition  algorithms  in  the  literature.  Vari¬ 
ous  researchers  collected  their  own  databases  under  conditions  relevant  to 
the  aspects  of  the  problems  that  they  were  examining.  Most  of  the  data¬ 
bases  were  small  and  consisted  of  images  of  less  than  50  individuals.  No¬ 
table  exceptions  were  databases  collected  by  three  primary  researchers: 

(1)  Alex  Pentland  of  the  Massachusetts  Institute  of  Technology  (MIT)  as¬ 
sembled  a  database  of  -7500  images  that  had  been  collected  in  a  highly 
controlled  environment  with  controlled  illumination;  all  images  had  the 
eyes  in  a  registered  location,  and  all  images  were  full  frontal  face  views. 

(2)  Joseph  Wilder  of  Rutgers  University  assembled  a  database  of  -250  indi¬ 
viduals  collected  under  similarly  controlled  conditions. 

(3)  Christoph  von  der  Malsburg  of  the  University  of  Southern  California 
(USC)  and  colleagues  used  a  database  of  -100  images  that  were  of  con¬ 
trolled  size  and  illumination  but  did  include  some  head  rotation. 


3.  Database 


A  standard  database  of  face  imagery  is  essential  for  the  success  of  this 
project,  both  to  supply  standard  imagery  to  the  algorithm  developers  and 
to  supply  a  sufficient  number  of  images  to  allow  testing  of  these  algo¬ 
rithms.  Harry  Wechsler  at  George  Mason  University  (GMU)  directed  the 
effort  to  collect  a  database  of  images  for  development  and  testing  (contract 
number  DAAL01-93-K-0099). 

The  images  of  the  faces  are  initially  acquired  with  a  35-mm  camera.  The 
film  used  is  color  Kodak  Ultra.  The  film  is  processed  by  Kodak  and  placed 
onto  a  CD-ROM  via  Kodak's  multiresolution  technique  for  digitizing  and 
storing  digital  imagery.  At  GMU,  the  color  images  are  retrieved  from  the 
CD-ROM  and  converted  into  8-bit  gray-scale  images.  After  being  assigned 
a  unique  file  name,  which  includes  the  subject's  identity  number,  the  im¬ 
ages  become  part  of  the  database.  The  identity  number  is  keyed  to  the  per¬ 
son  photographed,  so  that  any  future  images  collected  on  this  person  will 
have  the  same  ID  number  associated  with  the  images.  The  images  are 
stored  in  TIFF  format  and  as  raw  8-bit  data.  The  images  are  256  pixels  wide 
by  384  pixels  high.  Attempts  were  made  to  keep  the  interocular  distance 
(the  distance  between  the  eyes)  of  each  subject  to  between  40  and  60  pixels. 
The  images  consist  primarily  of  an  individual's  head,  neck,  and  sometimes 
the  upper  part  of  the  shoulders. 

The  images  are  collected  in  a  semi-controlled  environment.  To  maintain  a 
degree  of  consistency  throughout  the  database,  the  same  physical  setup  is 
used  in  each  photography  session.  However,  because  the  equipment  must 
be  reassembled  for  each  session,  there  is  some  variation  over  collections 
from  site  to  site  (fig.  1). 

The  facial  images  were  collected  in  11  sessions  from  August  1993  through 
December  1994.  Sessions  were  primarily  conducted  at  GMU,  with  several 
collections  done  at  ARL  facilities.  The  duration  of  a  session  was  one  or  two 
days,  and  the  location  and  setup  did  not  change  during  a  session.  Taking 
the  images  at  different  locations  introduced  a  degree  of  variation  in  the 
images  from  one  session  to  another  session,  which  reflects  real-world 
applications. 

A  photography  session  is  usually  performed  by  a  photographer  and  two 
assistants.  One  assistant  briefs  each  volunteer  and  obtains  a  written  release 
form  (see  app  C).  (A  release  form  is  necessary  because  of  the  privacy  laws 
in  the  United  States.)  The  other  assistant  directs  the  subject  to  turn  his  or 


Figure  1.  Examples  of  variations  among  collections. 
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Figure  2.  Possible 
aspects  collected  of 
subject  face. 


her  head  to  the  various  poses  required.  The  images  were  collected  at  dif¬ 
ferent  locatiorts,  so  there  is  some  variation  in  illumination  from  one  session 
to  another.  A  neutral  colored  roll  of  paper  was  used  as  a  standard  back¬ 
ground  in  the  images.  Subjects  wearing  glasses  were  asked  to  remove 
them. 

The  photographs  were  collected  under  relatively  unconstrained  condi¬ 
tions.  For  the  different  poses,  the  subjects  were  asked  to  look  at  marks  on 
the  wall,  where  the  marks  corresponded  to  the  aspects  defined  below. 

Some  questions  were  raised  about  the  age,  racial,  and  sexual  distribution 
of  the  database.  However,  at  this  stage  of  the  program,  the  key  issue  was 
algorithm  performance  on  a  database  of  a  large  number  of  individuals. 

A  set  of  images  of  an  individual  is  defined  as  consisting  of  a  minimum  of 
five  and  often  more  views  (see  fig.  2  and  3).  Two  frontal  views  are  taken, 
labeled /fl  and  fb.  One  is  the  first  image  taken  (fa)  and  the  other,  yi?,  usually 
the  last.  The  subject  is  asked  to  present  a  different  facial  expression  for  the 
fb  image.  Images  are  also  collected  at  the  following  head  aspects:  right  and 
left  profile  (labeled  pr  and  pi),  right  and  left  quarter  profile  {qr,  ql),  and 
right  and  left  half  profile  {hr,  hi).  Additionally,  five  extra  locations  {ra,  rb, 
rc,  rd,  and  re),  irregularly  spaced  among  the  basic  images,  are  collected  if 
time  permits.  Some  subjects  also  are  asked  to  put  on  their  glasses  and/or 
pull  their  hair  back  to  add  some  simple  but  significant  variation  in  the 
images. 

Each  individual  in  the  database  is  given  a  unique  ID  number.  The  ID  num¬ 
ber  is  part  of  the  file  name  for  every  image  of  that  person,  including  im¬ 
ages  from  different  sets.  In  addition,  the  file  name  encodes  head  aspect, 
date  of  collection,  and  any  other  significant  point  about  the  image  col¬ 
lected;  table  1  gives  a  detailed  description  of  the  image  name  convention. 


12 


Figure  3.  Typical  set  of  images  collected  in  one  sitting. 


Table  1.  Image  file  Example  file  name:  00346hr001c  .93  1230 

name  description.  I  I  I  I  I  I  _ I 


a  bed  e 


Seg- 

merit 

Category 

Code 

Explanation 

a 

ID  No. 

nnnnn 

Unique  for  each  individual. 

b 

Pose 

fa 

Full  face  or  frontal:  first  shot. 

fb 

Full  face  or  frontal:  last  shot. 

qr,  ql 

Quarter  profile,  right  and  left. 

hr,  hi 

Half  profile,  right  and  left. 

pr,pl 

Full  profile,  right  and  left. 

ra,  rb,  re,  rd,  re 

Arbitrary  (random)  positions  (see  fig.  1). 

c 

Special  flags 

(Left  flag) 

0 

Image  not  releasable  for  publication. 

1 

Image  may  be  used  for  publication  if  authorized. 

(Right  flag) 

0 

ASA-200  negative  film  used  for  collection. 

1 

ASA-400  negative  film  used  for  collection. 

(Middle  flag) 

0 

Image  not  histogram  adjusted. 

1 

Image  histogram  adjusted. 

d 

Special 

a 

Glasses  worn. 

circumstances 

b 

Duplicate  with  different  hair  length. 

c 

Glasses  worn  and  different  hair  length. 

d 

Electronically  scaled  and  histogram  adjusted. 

e 

Clothing  color  changed  electronically. 

f 

Image  brightness  reduced  by  40%. 

g 

Image  brightness  reduced  by  60%. 

h 

Image  scale  reduced  10%. 

i 

Image  scale  reduced  20%. 

j 

Image  scale  reduced  30%. 

e 

Date 

yymmdd 

Date  image  taken. 
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A  set  of  images  is  referred  to  as  a  duplicate  set  if  the  person  in  the  set  is  in 
a  previously  collected  set.  Some  people  have  images  in  the  database  span¬ 
ning  nearly  a  year  between  their  first  sitting  and  their  most  recent  one.  A 
number  of  subjects  have  been  photographed  several  times  (fig.  1). 

At  the  end  of  Phase  I  (August  1994),  673  sets  of  images  had  been  collected 
and  entered  into  the  imagery  database,  resulting  in  over  5000  images  in  the 
database.  At  the  time  of  the  Phase  El  test  (March  1995),  1109  sets  of  images 
were  in  the  database,  for  8525  total  images.  There  were  884  individuals  in 
the  database  and  225  duplicate  sets  of  images. 

The  primary  goal  of  the  image  collection  activities  in  the  fall  of  1994  was  to 
support  the  March  1995  test.  Approximately  300  sets  of  images  were  given 
out  to  algorithm  developers  as  a  developmental  data  set,  and  the  remain¬ 
ing  images  were  sequestered  by  the  government  for  testing  purposes. 

As  an  aid  in  the  evaluation  of  the  algorithms'  robustness  with  respect  to 
specific  variables,  the  sequestered  database  was  augmented  with  a  set  of 
digitally  altered  images.  The  database  collectors  changed  the  illumination 
levels  of  40  images  by  using  the  MATLAB  Image  Processing  Tool  Box  com¬ 
mand  "brighten  (),"  using  values  of -0.4  and  -0.6  to  create  images  with  the 
illumination  levels  reduced  by  approximately  40  and  60  percent,  respec¬ 
tively.  The  function  that  changes  the  illumination  is  nonlinear.  To  test  sen¬ 
sitivity  to  scale  changes,  they  electronically  modified  40  images  to  show 
10-,  20-,  and  30-percent  reductions  of  scale  along  each  axis,  using  the 
MATLAB  Image  Processing  Tool  Box  command  "imresize  ()."  This  com¬ 
mand  uses  a  low-pass  filter  on  the  original  image  to  avoid  aliasing,  and  bi¬ 
linear  interpolation  to  find  each  pixel  density  in  the  reduced  image.  This 
approximates  obtaining  the  images  at  a  greater  distance  from  the  camera. 
Finally,  using  Adobe  Photoshop's  paint  brush  tool,  the  database  collectors 
electronically  modified  portions  of  clothing  in  several  of  the  images  to  re¬ 
verse  the  contrast.  We  had  this  done  to  see  if  any  algorithms  were  using 
cues  from  clothing  for  recognition. 


4.  Phase  I 


4.1  Algorithm  Development 

The  FERET  program  was  initiated  with  an  open  request  for  proposals 
(RFP);  24  proposals  were  received  and  evaluated  jointly  by  Dot)  and  law 
enforcement  personnel.  The  winning  proposals  were  chosen  based  on  their 
advanced  ideas  and  differing  approaches.  In  Phase  I,  five  algorithm  devel¬ 
opment  contracts  were  awarded.  The  organizations  and  principal  investi¬ 
gators  for  Phase  I  were 

•  MIT,  Alex  Pentland  (contract  DAAL01-93-K-0115); 

•  Rutgers  University,  Joseph  Wilder  (contract  DAAL01-93-K-0119); 

•  The  Analytic  Science  Compamy  (TASC),  Gale  Gordon  (contract  DAALOl- 
93-K-0118); 

•  University  of  Illinois  at  Chicago  (UlC)  and  University  of  Illinois  at  Urbana- 
Champaigne,  Lewis  Sadler  and  Thomas  Huang  (contract  DAAL01-93-K- 
0114);  and 

•  use,  Christoph  von  der  Malsburg  (contract  DAAL01-93-K-0109). 

Only  information  and  results  for  contracts  that  were  extended  into  Phase  II 
are  given  in  this  report;  for  brief  descriptions  of  the  individual  approaches, 
see  appendix  C. 

4.2  Test  Procedure 

Three  distinct  tests  were  conducted,  each  with  its  own  probe  and  gallery 
set.  The  large  gallery  test  evaluates  the  algorithm  performance  on  a  large 
gallery  of  images,  the  false-alarm  test  evaluates  the  false-alarm  perfor¬ 
mance  of  the  algorithm,  and  the  rotation  test  was  designed  to  baseline  al¬ 
gorithm  performance  on  nonfrontal  (rotated)  images. 

TASC  and  USC  were  tested  on  1  to  3  August  1994,  and  MIT,  UIC,  and 
Rutgers  on  8  to  10  August  1994.  Government  representatives  arrived  at 
each  of  the  testee's  sites  to  administer  the  test.  The  government  representa¬ 
tive  brought  two  8-mm  computer  data  tapes  for  each  test  to  the  con¬ 
tractor's  site.  The  first  tape  of  each  test  contained  the  gallery,  and  the  sec¬ 
ond  tape  contained  the  probe  images. 

All  images  were  processed  while  the  government  representative  was 
present.  Results  from  the  test  were  recorded,  and  the  government  repre¬ 
sentative  took  the  results  back  to  the  government  facilities  for  scoring. 
At  the  conclusion  of  the  test,  both  the  gallery  and  probe  data  were  re¬ 
moved  from  the  testee's  computer  system  and  the  tapes  returned  to  the 
government. 
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To  ensure  that  matching  was  not  done  by  file  name,  the  government  gave 
the  gallery  and  probe  sets  random  file  ID  numbers,  and  kept  the  links  be¬ 
tween  the  file  name  and  ID  number  from  the  contractors  by  supplying 
only  the  ID  number  as  the  labels  for  the  gallery  and  probe  sets  for  the  test. 

A  "pose  flag"  was  also  supplied  for  each  image,  as  this  information  would 
be  expected  from  the  hypothetical  "face  detection"  front-end  that  supplies 
the  localized  faces  to  the  classification  algorithm.  The  pose  flag  tells  the 
pose  of  the  face  in  the  image  at  the  time  of  collection.  The  flags  are  fa,  ql,  qr, 
hi,  hr,  pi,  and  pr — the  same  pose  flags  as  in  the  FERET  database. 

The  computation  time  of  the  algorithms  was  not  measured  or  considered 
as  a  basis  for  evaluation.  However,  the  algorithms  had  to  be  able  to  per¬ 
form  the  tests  on  a  few  standard  workstation-type  computers  over  three 
days.  The  rationale  for  this  restriction  was  to  ensure  that  an  algorithm  was 
not  so  computationally  intensive  as  to  preclude  it  being  implemented  in  a 
real-time  system. 

Test  Design 

The  August  1994  FERET  evaluation  procedure  consisted  of  a  suite  of  three 
tests  designed  to  evaluate  face  recognition  algorithms  under  different  con¬ 
ditions.  The  results  from  the  suite  of  tests  present  a  robust  view  of  an 
algorithm  and  allow  us  to  avoid  judging  algorithm  performance  by  one 
statistic. 

The  first  test,  the  large  gallery  test,  measures  performance  against  large 
databases.  The  main  purpose  of  this  test  was  to  baseline  how  algorithms 
performed  against  a  database  when  the  algorithm  had  not  been  developed 
and  timed  with  a  majority  of  the  images  in  the  gallery  and  probe  sets. 

The  second  test,  the  false-alarm  test,  measures  performance  when  the  gal¬ 
lery  is  significantly  smaller  than  the  probe  set.  This  test  models  monitoring 
an  airport  or  port  of  entry  for  suspected  terrorists  where  the  occurrence  of 
the  suspects  is  rare. 

The  third  test,  the  rotation  test,  baselines  performance  of  the  algorithm 
when  the  images  of  an  individual  in  the  gallery  and  probe  set  have  differ¬ 
ent  poses.  Although  difficult,  this  is  a  requirement  for  numerous  applica¬ 
tions.  This  test  was  used  only  to  establish  a  baseline  for  future  compari¬ 
sons,  because  the  rotation  problem  was  out  of  the  scope  of  the  FERET 
program. 

The  algorithms  tested  are  fully  automatic.  The  processing  of  the  gallery 
and  the  probe  images  is  done  without  human  intervention.  The  input  to 
the  algorithms  for  both  the  gallery  and  the  probe  is  a  list  of  image  names 
along  with  the  nominal  pose  of  the  face  in  the  image.  The  images  in  the 
gallery  and  probe  sets  are  from  both  the  developmental  and  sequestered 
portions  of  the  FERET  database.  Only  images  from  the  FERET  database  are 
included  in  the  test.  Algorithm  developers  were  not  prohibited  from  using 
images  outside  the  FERET  database  to  develop  their  algorithms  or  tune 
parameters  in  their  algorithms.  The  faces  in  the  images  were  not  placed  in 
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a  predetermined  position  or  normalized.  If  required,  prepositioning  or 
normalization  must  be  performed  by  the  face  recognition  system. 

The  large  gallery  test  examines  recognition  rates  from  as  large  a  database 
as  was  available  at  the  time.  The  probe  set  consists  of  all  the  individuals  in 
the  gallery,  as  well  as  individuals  not  in  the  gallery.  For  this  test,  the  gal¬ 
lery  consisted  of  317  frontal  images  (one  per  person),  and  the  probe  set 
consisted  of  770  faces;  table  2  gives  a  breakdown  of  the  gallery  and  probe 
images  by  category. 

Each  set  of  facial  images  includes  two  frontal  images  (fa  and  fb  images),  as 
shown  in  figure  3.  One  of  these  images  is  placed  in  the  gallery  and  referred 
to  as  the  FA  image.  The  frontal  image  that  is  not  placed  in  the  gallery  is 
placed  in  the  probe  set  and  called  the  FB  image.  The  image  (fa  or  fb)  to  be 
designated  the  FA  image  can  be  selected  manually  or  randomly.  In  the 
August  1994  test,  all  the  fa  images  were  selected  to  be  the  FA  images.  In  the 
March  1995  test,  the  process  was  random,  with  a  50/50  chance  of  the/fl  or 
fb  image  being  selected  as  the  FA  image. 

For  diagnostic  purposes,  48  FA  images  were  placed  in  the  probe  set.  For 
these  images,  the  algorithms  should  produce  exact  matches  with  their  cop¬ 
ies  in  the  gallery.  Some  probe  images  were  not  in  the  gallery,  by  which  we 
mean  that  the  person  whose  image  was  in  the  probe  was  not  in  one  of  the 
gallery  images.  Duplicate  images  are  images  of  people  in  the  gallery  taken 
from  a  duplicate  set  of  images  of  that  person  (see  sect.  3  for  a  definition 
and  description  of  duplicate  sets  of  images).  All  the  duplicates  are  frontal 
images.  Quarter  and  half  rotations  are  those  images  with  head  rotation  as 
indicated  (hi,  hr,  ql,  and  qr,  as  shown  in  fig.  2  and  3).  The  remaining  cate¬ 
gories  consist  of  the  electronically  altered  frontal  images  discussed  in 
section  3. 


Table  2.  Type  and 
number  of  images 
used  in  gallery  and 

Image  category 

Number 

Gallery  images: 

probe  set  for  large 

FA  frontal  images 

317 

gallery  test. 

Probe  images: 

FA  frontal  images 

48 

FB  frontal  images 

316 

Frontal  probes  not  in  gallery 

50 

Duplicates 

60 

Quarter  rotations 

26 

Half  rotations 

48 

40%  change  in  illumination 

40 

60%  change  in  illumination 

40 

10%  reduction  in  scale 

40 

20%  reduction  in  scale 

40 

30%  reduction  in  scale 

40 

Contrast-reversed  clothes 

22 

Total  probes 

770 
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The  false-alarm  test  evaluates  the  false-alarm  performance  of  the  algo¬ 
rithms.  The  system  is  presented  with  a  small  gallery  and  a  large  probe  set, 
with  many  individuals  unmatched  in  the  gallery.  All  images  for  this  test 
were  full  frontal  face  images.  For  this  test,  a  gallery  of  25  frontal  faces  (one 
image  per  person)  was  supplied.  The  probe  set  consisted  of  305  images; 
table  3  gives  the  type  and  number  of  the  various  images. 

We  conducted  the  rotation  test  to  examine  algorithm  robustness  under 
head  rotations.  A  gallery  of  40  quarter-rotated  {qr  or  ql  images)  and  40  half- 
rotated  (hi  or  hr)  images  (one  per  person)  was  supplied  and  tested  with  the 
probe  set  defined  in  table  4. 

Because  the  approach  that  TASC  uses  requires  matched  face/ profile  pairs 
(see  app  C),  TASC  could  not  use  the  same  test  gallery  and  probe  sets. 
Therefore,  a  special  test  set  was  generated  for  evaluating  the  performance 
of  the  TASC  approach.  For  the  large  gallery  test,  the  gallery  consisted  of 
266  image  pairs,  with  the  probe  set  defined  in  table  5.  For  the  August  1994 
test,  the  reporting  of  confidence  values  was  optional,  and  TASC  elected 
not  to  report  the  confidence  scores.  Thus,  it  was  not  possible  to  construct  a 
receiver  operator  curve  (ROC)  for  TASC,  and  results  are  not  reported  for 
the  false-alarm  test.  (The  decision  to  construct  an  ROC  was  made  after 
TASC  took  the  test.)  Because  the  TASC  algorithm  required  frontal/profile 
pairs,  it  could  not  be  tested  for  rotation.  Hence,  the  rotation  test  was  not 


Table  3.  Type  and 
number  of  images 
used  in  gallery  and 

taken. 

Image  category 

Number 

Gallery  images: 

FA  frontal  images 

25 

probe  set  for  false- 

alarm  test. 

Probe  images: 

FB  frontal  images 

25 

Frontal  probe  images  not  in  gallery 

204 

40%  change  in  illumination 

10 

60%  change  in  illumination 

9 

10%  reduction  in  scale 

19 

20%  reduction  in  scale 

19 

Contrast-reversed  clothes 

19 

Total  probes 

305 

Table  4.  Type  and 

Image  category 

Number 

number  of  images 

used  in  gallery  and 

Gallery  images: 

probe  set  for  rotation 

Quarter  rotations 

40 

test. 

Half  rotations 

40 

Total  gallery 

80 

Probe  images: 

Quarter  rotations  {qr,ql) 

85 

Probes  not  in  gallery  {fa,fb,qr,qlhlM)  50 

Intermediate  rotations  (fa, fb, hi, hr) 

90 

Total  probes 

225 

18 


Table  5.  Type  and 
number  of  images 
used  in  gallery  and 

Image  category 

Number 

Gallery  images: 

probe  set  in  large 

FA  frontal  profile  image  pairs 

266 

gallery  test  for  TASC. 

Probe  images: 

Frontal  profile  image  pairs 

249 

FB  frontal  profile  pairs  not  in  gallery 

25 

40%  change  in  illumination 

10 

60%  change  in  illumination 

8 

10%  reduction  in  scale 

14 

20%  reduction  in  scale 

14 

30%  reduction  in  scale 

28 

Total  probes 

378 

4.4  Output  Format 

The  contractors  were  requested  to  supply  the  test  results  in  an  ASCII  file  in 
the  following  format:  the  probe  ID  number  being  tested,  a  rank  counter, 
the  gallery  ID  number  of  a  match,  and  a  false-alarm  flag  that  indicates 
whether  the  algorithm  determined  that  the  probe  was  in  the  gallery  or  not 
(1  if  the  algorithm  reported  that  the  probe  was  in  the  gcdlery  and  0  if  the 
probe  was  reported  as  not  in  the  gallery).  Also  requested  was  the  confi¬ 
dence  score  of  the  match;  see  table  6  for  an  example  of  an  output  file.  The 
score  of  the  match  is  a  number  that  measures  the  similarity  between  a 
probe  and  an  image  in  the  gallery.  Each  algorithm  used  a  different  meas¬ 
ure  of  similarity,  and  it  is  not  possible  to  directly  compare  similarity  meas¬ 
ures  between  different  algorithms.  Reporting  the  similarity  measure  was 
optional  on  the  August  1994  test.  All  algorithm  developers  except  for 
TASC  reported  this  number.  For  the  August  1994  large  gallery  test,  all  al¬ 
gorithm  developers  reported  the  top  50  gallery  matches  in  ranked  order 
for  each  probe.  For  the  false-alarm  test,  the  top  25  (the  size  of  the  gallery) 
were  reported,  and  in  the  rotation  test,  the  top  25  were  reported. 

No  testing  was  done  to  determine  how  the  algorithms  would  respond  to  a 
face-like  piece  of  clutter  that  might  be  forwarded  to  the  recognition  algo¬ 
rithm  from  the  face  detection  front-end.  Tests  of  this  nature  will  have  to 
wait  until  detection  and  recognition  algorithms  are  interfaced  together  m  a 
full  demonstration  system. 

4.5  Calculation  of  Scores 

The  results  for  the  FERET  phase  I  and  II  tests  are  reported  by  two  sets  of 
performance  statistics.  One  is  the  cumulative  matched  versus  rank  (cumu¬ 
lative  match)  and  the  other  is  the  receiver  operator  curve  (ROC).  Both 
scores  are  computed  from  the  output  files  provided  by  the  algorithm  de¬ 
velopers  (sect.  4.4).  The  selection  of  which  score  is  computed  depends  on 
the  test  and  analysis  being  performed. 

The  performance  results  for  the  large  gallery  test  and  the  rotation  test  are 
reported  by  a  graph  of  the  cumulative  match  score.  Performance  scores  are 
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Table  6.  Example  of  a 
results  file. 


1  3  45  1  87.34 

1  2  45  1  75.45 

1  3  111  1  67.23 

1  50  231  0  11.56 

reported  for  a  number  of  subsets  of  the  probe  set.  It  is  not  possible  to  com¬ 
pute  the  cumulative  match  score  for  the  entire  probe  set,  because  the  probe 
set  contains  probes  that  are  not  in  the  gallery.  For  the  large  gallery  test,  we 
report  the  cumulative  match  score  for  the  subset  of  all  probes  that  have  a 
corresponding  match  in  the  gallery  and  for  all  categories  listed  in  table  2 
(sect.  4.3),  except  the  FA  versus  FA  category.  Probes  not  in  the  gallery  are 
not  counted  towards  the  cumulative  score. 

In  the  large  gallery  test,  each  algorithm  reports  the  top  50  matches  for  each 
probe,  provided  in  a  rank-ordered  list  (table  6).  From  this  list  one  can  de¬ 
termine  if  the  correct  answer  of  a  particular  probe  is  in  the  top  50,  and  if  it 
is,  how  far  down  the  list  is  the  correct  match.  For  example,  for  probe  1,  if 
the  correct  match  is  with  gallery  image  22,  and  the  match  between  probe  1 
and  gallery  image  22  is  ranked  number  10  (the  algorithm  being  tested  re¬ 
ports  that  there  are  nine  other  gallery  images  that  are  better  matches  than 
gallery  image  22),  then  we  say  that  the  correct  answer  for  probe  1  is  rank 
10. 

For  a  probe  set  we  can  find  for  how  many  probes  the  correct  answer  is 
ranked  5  or  less.  In  the  previous  example,  probe  1  would  not  be  counted. 
The  figures  in  this  report  show  the  percentage  of  probes  that  are  of  a  par¬ 
ticular  rank  or  less.  The  horizontal  axis  is  the  rank,  and  the  vertical  axis  the 
percentage  correct.  For  example,  for  the  MIT  curve  in  figure  4  (sect.  4.6), 
the  first  box  indicates  that  the  correct  answer  was  rank  1  for  80  percent  of 
the  probes,  the  box  at  position  2  indicates  that  the  correct  answer  was  rank 
1  or  2  for  ~82  percent  of  the  probe  images,  that  ~87  percent  of  the  probes 
were  of  rank  10  or  less,  etc. 

The  following  formula  is  used  to  compute  scores  for  a  given  category.  To 
make  the  explanation  concrete,  we  use  the  class  of  duplicate  images  in  the 
large  gallery  test.  Let  P  be  a  subset  of  probe  images  in  the  probe  set;  e.g.,  P 
is  the  set  of  duplicate  images  in  the  large  gallery  test  for  USC.  The  number 
of  images  in  P  is  denoted  by  I P I  ;  in  this  example  I P 1  is  50.  Let  the 
number  of  probes  in  P  that  are  ranked  k  or  less;  e.g.,  if  fc  =  10,  then  =  43. 
Thus,  the  percentage  of  probes  that  are  rank  k  or  less  is  Rj/P,  or  in  the  ex¬ 
ample  case,  Rjo/  I P I  =  43/50  =  0.86  (fig.  6,  sect.  4.6). 

For  the  false-alarm  test,  an  ROC  is  used  to  evaluate  the  algorithms.  The 
ROC  allows  one  to  assess  the  trade-off  between  the  probability  of  false 


Probe  ID  number 

Matched  gallery  ID  number 

Rank  _  False  alarm  flag 

Matching  score 
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alarm  and  the  probability  of  correct  identification.  In  the  false-alarm  test, 
there  are  two  primary  categories  of  probes.  The  first  are  probes  not  in  the 
gallery  that  generate  false  alarms.  A  false  alarm  occurs  when  an  algorithm 
reports  that  one  of  these  probes  is  in  the  gallery.  The  false-alarm  rate  is  the 
percentage  of  probes  not  in  the  gallery  that  are  falsely  reported  as  being  in 
the  gallery.  The  false-alarm  rate  is  denoted  by  Pp.  The  second  category  of 
probes  is  the  set  that  is  in  the  gallery.  This  set,  characterized  by  the  per¬ 
centage  of  these  probes  that  are  correctly  identified,  is  denoted  by  Pj.  The 
pair  of  values  Pj  and  Pp  describe  the  operation  of  a  system  in  an  open  uni¬ 
verse;  in  an  open  universe,  not  every  probe  is  in  the  gallery. 

There  is  a  trade-off  between  Pp  and  Pj.  If  every  probe  is  tagged  as  a  false 
alarm,  then  Pp  =  0  and  P^  =  0.  At  the  other  extreme,  if  no  probes  are  de¬ 
clared  to  be  false  alarms,  then  Pp  =  1  and  Pj  is  the  percentage  of  probes  in 
the  gallery  with  a  rank  1.  For  an  algorithm,  performance  is  not  character¬ 
ized  by  a  single  pair  of  statistics  (P/,Pp)  but  rather  by  all  pairs  (P/,Pp),  and 
this  set  of  values  is  an  ROC  (see  fig.  16,  sect  4.6.2:  the  horizontal  axis  is  the 
false-alarm  rate  and  the  vertical  axis  the  probability  of  correct  identifica¬ 
tion).  From  the  ROC  it  is  possible  to  compare  algorithms. 

Say  we  are  given  algorithm  A  and  algorithm  B,  along  with  a  false-alarm 
rate  for  each,  P^^  and  Pp®,  and  a  probability  of  correct  identification  for 
each,  Pj^  and  P^®.  Algorithms  A  and  B  cannot  be  compared  from  the  per¬ 
formance  points  (Pj'',  Pp'')  and  (Pj®,  Pp®).  This  is  especially  true  if  (P/^,  Pp^) 
and  (Pj®,  Pp®)  are  not  close  in  value.  The  two  systems  may  be  operating  at 
different  points  on  the  same  ROC,  or,  for  different  values  of  Pp  or  Pj,  one 
algorithm  could  have  better  performance. 

For  each  Pp  or  Pj,  an  optimal  decision  rule  could  be  constructed  to  maxi¬ 
mize  performance  for  the  other  parameter.  For  testing  and  evaluating  al- 
goritluns,  it  is  not  practical  to  construct  an  ROC  in  this  manner,  and  an  ap¬ 
proximation  is  used.  For  each  probe,  the  algorithm  reports  the  person  in 
the  gallery  with  which  the  probe  is  most  similar,  along  with  a  confidence 
score.  The  test  scorer  obtains  this  information  from  the  results  file  by  read¬ 
ing  the  information  about  the  highest  ranked  gallery  image.  Assume  that  a 
high  confidence  score  implies  greater  likelihood  that  images  are  of  the 
same  person.  Apply  a  threshold  to  the  confidence  score.  The  algorithm  re¬ 
ports  that  the  probe  is  not  in  the  gallery  if  the  confidence  score  is  below  the 
threshold.  If  the  match  score  is  greater  than  or  equal  to  the  threshold,  then 
estimate  the  identity  of  the  probe  as  the  gallery  image  with  the  highest 
confidence  score.  A  false  alarm  is  a  probe  whose  match  score  is  greater 
than  or  equal  to  the  threshold  and  is  not  in  the  gallery.  Let  F  denote  the 
number  of  false  alarms.  The  probability  of  a  false  alarm  is  Pp  =  F/F*,  where 
F*  is  the  number  of  probes  in  the  probe  set  that  are  not  in  the  gallery.  A 
probe  in  the  gallery  is  correctly  identified  if  the  algorithm  reports  the  cor¬ 
rect  identity,  and  the  match  score  is  greater  than  or  equal  to  the  threshold. 
The  probability  of  correct  identification  is  Pj  =  1  /!*,  where  /  is  the  number 
of  probes  correctly  identified,  and  I*  is  the  number  of  probes  in  the  probe 
set  that  are  in  the  gallery. 
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We  generated  the  ROC  by  varying  the  threshold  and  recomputing  Pp  and 
Pj  for  each  threshold.  Initially,  the  threshold  is  set  higher  than  the  highest 
match  score.  This  will  generate  the  point  Pp  =  0  and  P^  =  0.  The  threshold  is 
incrementally  lowered,  and  for  each  value,  Pp  and  Pj  are  computed.  The 
process  of  lowering  the  threshold  will  sweep  out  the  ROC,  and  Pp  and  Pj 
will  monotonically  increase. 

4.6  Results’^ 

4.6.1  Large  Gallery  Test  Performance 

The  results  for  the  large  gallery  test  are  reported  as  cumulative  match  ver¬ 
sus  rank.  Scores  are  presented  for  overall  performance  and  for  a  number  of 
different  categories  of  probe  images.  Table  7  shows  the  categories  corre¬ 
sponding  to  the  figures  presenting  these  results  (fig.  4  to  15). 

Figure  4  reports  overall  performance,  where  the  probe  set  consisted  of  all 
probes  for  which  there  was  a  gallery  image  of  the  person  in  the  probe.  This 
includes  the  FA,  FB,  duplicate,  rotation,  and  electronically  altered  images. 
The  figure  indicates  the  number  of  probe  images  scored  for  this  category: 
e.g.,  for  MIT  there  were  770  probes  in  the  overall  category,  and  for  TASC 
there  were  378  probes.  This  information  is  provided  for  all  the  figures.  All 
scores  in  figures  4  and  6  to  15  were  adjusted  to  take  into  account  an  error  in 
the  construction  of  the  test  set:  180  images  that  did  not  meet  the  require¬ 
ments  for  the  Phase  1  effort  were  mistakenly  included  in  the  gallery  and 
had  to  be  removed  from  all  the  scored  results;  in  these  images,  the  face 
took  up  much  less  of  the  field  of  view  than  had  been  specified.  The  annota- 


Table  7.  Figures  reportiirg  results  for  large  gallery  test. 


Figure 

no. 

Category 

title 

Description 
of  category 

4 

Adjusted  overall  match 

Score  for  all  probes  in  gallery,  adjusted  for  180  probes  placed  by 
mistake  in  probe  set. 

5 

Unadjusted  overall  match 

Score  for  all  probes  in  gallery  including  180  probes  placed  by 
mistake  in  probe  set. 

6 

Duplicate  match 

Given  a  duplicate  frontal  image,  find  frontal  match. 

7 

FA  versus  FB  match 

Given  FB  frontal  image,  find  frontal  match  from  same  set. 

8 

Quarter  match 

Given  quarter  profile,  find  frontal  match. 

9 

Half  match 

Given  half  profile,  find  frontal  match. 

10 

10%  scale  match 

Given  an  image  reduced  by  10%,  find  frontal  match. 

11 

20%  scale  match 

Given  an  image  reduced  by  20%,  find  frontal  match. 

12 

30%  scale  match 

Given  an  image  reduced  by  30%,  find  frontal  match. 

13 

40%  illumination  match 

Given  an  image  with  brightness  reduced  to  40%,  find  frontal  match. 

14 

60%  illumination  match 

Given  an  image  with  brightness  reduced  to  60%,  find  frontal  match. 

15  a 

Clothes  change — dark 

Given  an  image  with  clothing  contrast  changed  darker  than  original, 
find  match. 

15  b 

Clothes  change — light 

Given  an  image  with  clothing  contrast  changed  lighter  than  original, 
find  match. 

'Results  are  presented  only  for  contractors  whose  funding  was  continued  into  Phase  11. 
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tion  "adjusted"  in  the  figures  indicates  that  the  scores  were  adjusted  for 
this  reason.  However,  MIT  and  USC  voluntarily  took  the  test  with  these 
more  difficult  images.  Figure  5  shows  a  comparison  of  the  overall  perfor¬ 
mance  on  the  uncorrected  set  of  images,  along  with  that  for  the  adjusted 
set  of  probes.  Figure  6  shows  the  performance  on  the  duplicate  frontal  im¬ 
ages.  These  scores  are  also  adjusted  for  images  that  were  unreadable  be¬ 
cause  of  computer  media  damage.  Figure  7  shows  the  performance  on  the 
FB  frontal  images. 

Figures  8  to  15  show  performance  for  each  of  the  remaining  categories 
from  table  2,  except  for  the  FA  images  and  probes  that  are  not  in  the 
gallery. 

4.6.2  False-Alarm  Test  Performance 

Figure  16  shows  the  ROC  generated  from  the  false-alarm  test.  We  adjusted 
these  values  also  to  remove  images  that  were  unreadable  because  of  com¬ 
puter  media  damage.  We  report  only  overall  performance  results  for  the 
entire  probe  set. 

4.6.3  Rotated  Gallery  Test  Performance 

Figure  17  shows  the  results  for  the  test  examining  the  algorithms'  robust¬ 
ness  under  nonfrontal  images  in  the  gallery  (also  adjusted  to  omit  unread¬ 
able  images).  We  report  only  overall  performance  results  for  the  entire 
probe  set. 


Figure  4.  Large 
gallery  test:  overall 
scores,  adjusted 
(August  1994). 
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Figure  5.  Large  gallery  test:  overall  scores:  full  set  versus  corrected  set  (August  1994), 


Rank 

Figure  6.  Large  gallery  test:  duplicate  scores:  adjusted  (August  1994). 
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Figure  8.  Large  gallery  test:  quarter  profile  scores,  adjusted  (August  1994). 
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ure  9.  Large  gallery  test:  half  profile  scores,  adjusted  (August  1994). 
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Figure  10.  Large  gallery  test:  10%  scale  reduction  scores,  adjusted  (August  1994). 
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Figure  11.  Large  gallery  test:  20%  scale  reduction  scores,  adjusted  (August  1994) 
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Figure  12.  Large  gallery  test:  30%  scale  reduction  scores,  adjusted  (August  1994) 
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igure  13.  Large  gallery  test:  40%  of  illumination  scores,  adjusted  (August  1994) 
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Figure  14.  Large  gallery  test:  60%  of  illumination  scores,  adjusted  (August  1994) 
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Figure  15.  Large  gallery  test:  (a)  clothing  color  darkened  scores,  adjusted;  (b)  clothing 
color  lightened  scores,  adjusted  (August  1994). 


1.0 


Figure  16.  False-alarm  test;  ROC  (August  1994). 
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4.7  Analysis 

Performance  of  the  algorithms  falls  roughly  into  three  categories.  In  the 
first  category  are  the  algorithms  of  MIT  and  USC;  both  these  algorithms 
perform  comparably  on  the  large  gallery  test  and  on  the  false-alarm  test. 
The  second  category  consists  of  the  TASC  algorithm,  and  the  third  cat¬ 
egory  is  the  Rutgers  algorithm.  As  a  rule  there  is  a  noticeable  difference  in 
performance  between  each  category.  It  is  harder  to  draw  definite  conclu¬ 
sions  about  performance  within  the  category,  because  there  is  no  estimate 
of  the  variance  of  the  recognition  scores;  e.g.,  we  do  not  know  how  the  per¬ 
formance  score  would  change  if  we  moved  the  FB  images  to  the  gallery 
and  the  FA  images  to  the  probe  set. 

The  graphs  show  that  the  MIT,  USC,  and  TASC  approaches  consistently 
outperform  the  Rutgers  approach.  The  testing  sets  for  TASC  are  different 
from  the  others,  so  the  TASC  results  can  be  compared  only  roughly;  an  ex¬ 
act  comparison  was  not  possible  from  these  test  results,  because  of  the 
need  for  different  test  sets. 

Comparison  of  figures  4  and  8  shows  that  the  Rutgers  and  MIT  algorithms 
are  very  sensitive  to  changes  in  profile,  particularly  MIT.  The  USC  algo¬ 
rithm  maintains  high  performance  for  quarter-profile  images,  but  perfor¬ 
mance  drops  considerably  for  half  profiles  (fig.  9).  Most  of  the  algorithms 
show  little  if  any  degradation  under  scale  reduction  up  to  30  percent  (fig. 
10  to  12).  Likewise,  USC  and  TASC  show  greater  sensitivity  to  illumination 
than  the  other  algorithms  (fig.  13  and  14).  Examination  of  figure  5  shows 
that  the  mistakenly  included  gallery  images  are  indeed  harder  to  use,  as 
both  the  MIT  and  USC  algorithms  show  an  8  to  9  percent  drop  in  perfor¬ 
mance  when  these  images  are  included  in  the  gallery. 

The  false-alarm  test  (fig.  16)  shows  the  same  breakout  in  performance 
groups  as  the  large  gallery  test:  MIT  and  USC  are  comparable  across  the 
entire  ROC,  and  they  outperform  Rutgers. 

The  rotation  test  confirms  the  finding  from  the  large  gallery  test  that  rota¬ 
tion  is  a  hard  problem  and  was  beyond  the  scope  of  phase  I  of  the  FERET 
program.  On  the  rotation  test,  MIT  and  Rutgers  had  comparable  perfor¬ 
mance  and  outperformed  USC.  This  is  in  contrast  to  the  large  gallery  test, 
where  USC  outperformed  MIT  and  Rutgers  on  the  rotation  categories. 

The  conclusion  drawn  from  the  phase  I  test  was  that  the  next  step  in  the 
development  of  face  recognition  algorithms  was  to  concentrate  on  larger 
galleries  and  on  recognizing  faces  in  duplicate  images.  The  large  gallery 
test  established  a  baseline  for  algorithm  performance.  The  algorithms 
tested  demonstrated  a  level  of  maturity  that  allows  them  to  automatically 
process  a  gallery  of  316  images  and  a  probe  set  of  770  images.  The  results 
on  all  categories  of  probes  were  well  above  chance,  and  the  algorithms 
demonstrated  various  degrees  of  invariance  to  changes  in  illumination, 
scale,  and  clothing  color. 
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The  decision  to  concentrate  on  larger  galleries  and  duplicates  was  driven 
by  real-world  considerations.  All  applications  require  algorithms  to  recog¬ 
nize  people  from  images  taken  on  different  days,  and  many  users  require 
the  algorithms  to  work  on  databases  of  over  10,000  individuals.  The  other 
hard  problem  identified  by  the  test  was  recognizing  faces  when  the  probe 
and  gallery  image  have  different  poses.  It  was  decided  to  delay  working 
on  this  problem  to  avoid  spreading  the  research  effort  too  thinly.  Also, 
solving  the  duplicate  problem  is  a  prerequisite  to  the  rotation  probe.  Real- 
world  applications  will  use  rotated  images  taken  at  different  times. 


5.  Phase  II 


In  Phase  II,  TASC,  MIT,  and  USC  continued  development  of  their  ap¬ 
proaches.  The  MIT  and  USC  teams  continued  work  on  developing  face 
recognition  algorithms  from  still  images.  The  TASC  effort  switched  to  de¬ 
veloping  an  algorithm  for  recognizing  faces  from  video.  The  emphasis  was 
to  estimate  the  three-dimensional  shape  of  the  face  from  motion  and  recog¬ 
nize  the  face  based  on  its  shape.  In  phase  II,  Rutgers  performed  a  study 
comparing  and  assessing  the  relative  merits  of  long-wave  infrared  images 
and  visible  images  for  face  recognition  and  detection.  Their  results  are  not 
reported  here.  Since  the  Rutgers  and  TASC  efforts  pursued  different  av¬ 
enues,  it  was  not  appropriate  for  their  algorithms  to  take  the  phase  11  test. 

Phase  I  of  the  FERET  program  established  a  baseline  for  face  recognition 
algorithms;  the  goal  of  phase  II  was  to  improve  the  performance  of  the 
algorithms  to  the  point  that  tfiey  could  be  ported  to  a  real-time  experimen¬ 
tal/  demonstration  system.  An  experimental/ demonstration  system  would 
enable  one  to  collect  performance  statistics  over  a  longer  time  period  than 
is  possible  with  a  laboratory  test. 

One  of  the  conclusions  from  the  phase  I  test  was  that  greater  improvement 
was  needed  in  the  ability  of  algorithms  to  recognize  faces  when  the  probe 
and  gallery  images  were  taken  weeks,  months,  or  years  apart  (duplicate 
images).  Another  major  concern  was  how  algorithm  performance  would 
scale  as  the  size  of  the  gallery  increased.  In  phase  H,  both  the  MIT  and  USC 
teams  concentrated  on  these  two  issues.  As  a  measure  of  progress,  both 
MIT  and  USC  took  the  March  1995  phase  II  FERET  test.  The  data  collection 
activities  in  phase  II  were  designed  to  support  the  March  1995  test. 

The  March  1995  test  consisted  of  one  test  that  was  an  enlarged  version  of 
the  large  gallery  test  of  August  1994.  The  main  difference  is  that  the  gallery 
consisted  of  831  individuals,  and  there  were  463  duplicate  images  in  the 
probe  set.  The  designation  of  the  fa  or  fb  frontal  image  as  FA  was  deter¬ 
mined  randomly.  Only  780  out  of  the  831  FB  images  were  placed  in  the 
probe  set.  The  breakout  of  the  images  in  the  test  is  given  in  table  8. 

The  testing  procedure  for  March  1995  was  the  same  as  for  the  August  1994 
test.  The  test  was  administered  at  MIT  on  1  to  2  March  1995  and  at  USC  on 
6  to  8  March  1995.  The  time  limit  for  taking  the  test  was  three  days. 

In  phase  II,  the  MIT  team  developed  two  versions  of  their  face  recognition 
algorithm.  In  the  "original"  version,  the  feature  locator  module  passed  the 
top  location  for  each  feature  to  the  identification  module,  and  in  the  "hier¬ 
archical"  version,  the  top  three  locations  were  passed  to  the  identification 
module.  Both  versions  of  the  algorithm  were  tested. 
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5.1 


Results 


The  contractors  were  requested  to  supply  the  test  results  in  the  same  for¬ 
mat  as  the  earlier  Phase  I  test,  as  shown  in  table  6,  except  that  the  ranked 
list  was  to  include  the  top  100  matches  instead  of  the  top  50. 

The  scorirtg  protocol  for  this  test  is  the  same  as  the  large  gallery  test  from 
phase  I,  and  the  results  are  scored  and  reported  in  the  same  manner.  Table 
9  shows  the  categories  of  images  corresponding  to  the  figures  presenting 
the  results  (fig.  18  to  28). 


Table  8.  Number  and 
types  of  images  used 
in  March  1995  test. 


Image  category 

Number 

GaUery  images: 

FA  frontal  images 

831 

Probe  images: 

FA  Frontal  images  (Ja) 

71 

FB  frontal  images 

780 

Probes  not  in  gallery  (frontal  images) 

45 

Duplicate  frontal  images 

463 

Quarter  rotations 

33 

Half  rotations 

48 

40%  change  in  illumination 

40 

60%  change  in  illumination 

40 

10%  reduction  in  scale 

40 

20%  reduction  in  scale 

40 

30%  reduction  in  scale 

40 

Contrast-reversed  clothes 

40 

Total  probes 

1680 

Table  9.  Figures  reporting  results  for  March  1995  test. 


Figure 

no. 

Category 

title 

Description 
of  category 

18 

Overall  match 

Given  any  probe  aspect,  find  correct  ID. 

19 

FA  versus  FB  match 

Match  FB  frontal  images  from  same  set. 

20 

Duplicate  match 

Match  frontals  collected  on  different  dates. 

21 

Quarter  match 

Given  quarter  profile,  find  frontal  match. 

22 

Half  match 

Given  half  profile,  find  frontal  match. 

23 

60%  illumination  match 

Given  an  image  with  brightness  reduced  to  60%,  find  frontal 
match. 

24 

40%  illumination  match 

Given  an  image  with  brightness  reduced  to  40%,  find  frontal 
match. 

25 

10%  scale  match 

Given  an  image  reduced  by  10%,  find  frontal  match. 

26 

20%  scale  match 

Given  an  image  reduced  by  20%,  find  frontal  match. 

27 

30%  scale  match 

Given  an  image  reduced  by  30%,  find  frontal  match. 

28 

Clothes  change 

Given  an  image  with  clothes  contrast  changed,  find  match. 
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Figure  23.  Large  gallery  test:  60%  original  illumination  (March  1995) 
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Figure  24.  Large  gallery  test:  40%  original  illumination  (March  1995). 
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Figure  25.  Large  gallery  test:  10%  reduced  image  size  (March  1995). 
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Figure  28.  Large 
gallery  test:  clothes 
contrast  change 
(March  1995). 
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Analysis 


Analysis  of  figure  18  shows  that  the  USC  and  the  two  MIT  algorithms  per¬ 
formed  well  on  the  test  set,  with  the  USC  algorithm  showing  slightly  better 
results.  Figure  19  shows  that  for  frontal  images  taken  on  the  same  date,  the 
algorithms  give  virtually  identical  results.  All  the  algorithms  show  a 
marked  decrease  in  performance  when  the  test  images  were  taken  on  dif¬ 
ferent  dates  from  those  of  the  gallery  images  (fig.  20),  with  the  MIT  algo¬ 
rithms  showing  a  greater  decrease  in  performance.  Figures  21  and  22  show 
that  all  the  algorithms  are  still  sensitive  to  the  angle  of  the  face  to  be  recog¬ 
nized,  especially  the  MIT  algorithms.  The  MIT  algorithms  show  almost  no 
decrease  in  performance  due  to  reduced  illumination  (fig.  23  to  24).  The 
USC  algorithm  exhibits  degraded  performance  after  illumination  is  re¬ 
duced  to  40  percent  of  original.  All  Ae  algorithms  demonstrate  insensitiv¬ 
ity  to  reduced  image  size  up  to  30  percent  (fig.  25  to  27).  The  algorithms 
were  not  "tested  to  failure"  by  continual  reductions  in  image  size,  because 
the  research  groups  were  told  that  variations  in  scale  would  not  exceed  a 
factor  of  two.  The  algorithms  also  do  not  degrade  significantly  when  the 
clothes  contrast  changes  (fig.  28),  suggesting  that  the  algorithms  have  been 
successful  in  using  the  face  features  for  recognition. 

The  MIT  modification  for  hierarchical  searching  for  features  has  little  im¬ 
pact  on  the  recognition  of  probe  images  if  the  image  is  frontal  face,  as  can 
be  seen  in  figures  18  to  20  and  23  to  26.  It  did  improve  the  performance 
slightly  on  images  with  the  largest  scale  change  (fig.  27).  The  most  notable 
difference  in  performance  between  the  hierarchical  approach  and  the  stan¬ 
dard  approach  can  be  seen  in  the  rotated  images  (fig.  21  and  22).  The  hier¬ 
archical  approach  shows  a  significant  improvement  in  performance  on  the 
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quarter-profile  images  and  a  modest  decrease  in  performance  on  the  half¬ 
profile  images.  This  indicates  that  the  hierarchical  approach  does  improve 
robustness  on  images  where  the  face  is  not  full  frontal  but  most  of  the  face 
is  presented.  The  loss  of  performance  on  the  half-profile  images  may  be 
due  to  difficulties  in  locating  the  eye  farthest  from  the  camera:  notice  in  fig¬ 
ure  3  the  differences  between  the  ql  and  hi  and  between  the  qr  and  hr  im¬ 
ages.  Only  in  the  quarter  images  can  both  eyes  be  fully  seen. 

As  a  means  of  assessing  the  effect  of  gallery  size  on  performance,  the  MIT 
standard  and  algorithm  was  tested  on  a  series  of  galleries  of  increasing 
size:  the  graduated  gallery  study.  Gallery  sizes  of  100, 200, 400,  600,  and 
831  were  used  by  the  MIT  team  to  test  the  capacity  versus  performance  of 
their  system.  Figures  29  to  34  show  the  size  of  the  gallery  and  number  of 
probes  scored.  These  galleries  were  a  subset  of  the  original  831-person  gal¬ 
lery,  and  for  each  run  of  this  experiment,  the  original  probe  set  of  1680  was 
used.  In  computing  the  scores,  the  appropriate  subset  of  probes  was  used: 
i.e.,  in  the  gallery  of  100  people,  the  FA  versus  FB  results  involved  only  FB 
images  in  the  probe  set  that  were  in  this  gallery. 

Figures  29  through  34  show  the  MIT  algorithm's  performance  for  overall, 
duplicate,  and  FB  images  with  galleries  of  increasing  size.  These  figures 
show  the  expected  decline  in  performance  as  the  gallery  becomes  larger. 
Figures  31  and  34  show  that  for  duplicates  (frontal  images  taken  on  a  dif¬ 
ferent  date  from  that  of  the  gallery  image),  going  from  a  gallery  of  100  indi¬ 
viduals  to  one  of  831  individuals  causes  more  than  a  10-percent  reduction 
in  performance. 
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Figure  29.  Graduated 
gallery  study:  overall 
scores  (March  1995). 
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Figure  30.  Graduated  gallery  study:  FA  versus  FB  scores  (March  1995). 
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Figure  31.  Graduated  gallery  study:  duplicate  scores  (March  1995). 
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Figure  32.  Graduated  gallery  study:  overall  scores  (March  1995). 


Figure  33.  Graduated  gallery  study:  FA  versus  FB  scores  (March  1995). 
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Cumulative  match  score 


Figure  34.  Graduated  gallery  study:  duplicate  scores  (March  1995). 
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6.  Comparison  of  August  1994  and  March  1995  Test 
Performance 

The  principal  objective  for  the  August  1994  test  was  to  evaluate  each  algo¬ 
rithm  against  a  common  baseline  so  that  we  could  quantitatively  measure 
each  algorithm's  performance  and  compare  it  to  other  algorithms  on  a 
common  test  set.  In  addition,  during  Phase  I,  we  evaluated  each  algorithm 
to  determine  its  potential  for  solving  or  at  least  contributing  to  solving  the 
more  complex  face  recognition  problems  of  the  future.  Finally,  the  overall 
results  of  Phase  I  were  considered  in  the  selection  of  three  research  groups 
to  continue  algorithm  development  (out  of  the  original  five). 

In  contrast,  the  principal  objectives  of  the  March  1995  evaluation  were  to 
assess  the  maturity  of  the  two  algorithms  tested  and  to  determine  if  either 
or  both  were  mature  enough  to  be  used  in  a  demonstration  system.  This 
included  testing  against  a  more  demanding  and  difficult  test,  including  a 
larger  database  and  more  duplicate  images.  In  addition,  the  March  1995 
test  was  used  to  measure  the  performance  improvements  of  recent  modifi¬ 
cations  to  both  algorithms.  Although  the  performance  numbers  decrease, 
the  actual  performance  of  both  algorithms  was  judged  to  have  improved, 
because  they  were  successful  despite  increases  in  the  number  of  images,  in 
the  number  of  duplicates,  and  in  the  difficulty  of  the  test.  Because  of  these 
factors,  any  comparison  of  the  August  1994  and  March  1995  results  is  very 
difficult. 

However,  one  test  in  particular  can  be  compared.  The  FA  versus  FB  test, 
which  identifies  the  alternative  frontal  images  from  the  same  collection 
date,  is  not  affected  by  the  presence  of  duplicate  images.  It  is,  therefore, 
reasonable  to  compare  these  test  results.  The  March  1995  testing  provides 
greater  insight  into  the  effects  of  an  increased  database  as  reflected  by  the 
increased  gallery  size.  Figure  35  shows  that  the  absolute  performance  in¬ 
creased  as  the  gallery  size  increased  for  the  USC  algorithm,  but  no  signifi¬ 
cant  change  was  observed  for  the  MIT  standard  algorithm. 

One  of  the  primary  investigations  of  the  March  1995  test  studied  the  effect 
of  duplicate  images  on  performance.  This  test  was  of  key  importance  to  the 
FERET  program  and  is  also  one  of  the  most  difficult  problems  to  be  ad¬ 
dressed  by  any  face  recognition  algorithm.  The  March  1995  test  provided  a 
lOx  increase  in  duplicates  and  a  2.5x  increase  in  gallery  size  over  the 
August  1994  test. 

Comparing  the  effects  of  duplicate  images  on  the  August  1994  and  March 
1995  test  results,  we  determined  that  the  correct  recognition  of  individuals 
had  declined,  in  the  absolute  sense.  However,  the  March  1995  test  pro¬ 
vided  a  more  stringent  evaluation  of  each  algorithm's  performance  by  pro¬ 
viding  a  more  robust  and  diverse  database.  Therefore,  we  view  the  decline 
in  performance  as  minimal,  given  the  nature  of  the  problem  and  the  sig¬ 
nificant  increase  in  the  number  of  duplicate  images  used  in  testing.  This 
result,  combined  with  comparable  FA  versus  FB  scores  against  a  larger 
gallery,  leads  us  to  conclude  that  the  MIT  and  USC  algorithms  performed 
better  in  the  March  1995  test. 
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7.  Tests  on  Algorithms  Outside  FERET  Program 

At  the  time  of  this  report,  only  one  other  organization  had  submitted  an  al¬ 
gorithm  for  government  testing.  Joseph  Atick,  head  of  the  Laboratory  of 
Computational  Neuroscience  at  Rockefeller  University,  New  York,  re¬ 
quested  a  government  test  of  the  Rockefeller  algorithm.  This  algorithm 
was  tested  with  the  large  gallery  test  of  March  1995  and  the  false-alarm  test 
of  August  1994  at  the  Rockefeller  site  on  6  to  8  November  1995,  under  the 
same  constraints  as  the  previous  tests.  This  report  contains  no  information 
on  the  algorithmic  approach,  as  these  details  were  not  revealed  to  us. 

The  Rockefeller  algorithm  performs  quite  well.  Figures  36  to  39  show  the 
Rockefeller  results  plotted  with  the  MIT  and  USC  results  from  the  Phase  11 
test.  The  algorithm  performs  significantly  better  than  any  tested  algorithm 
on  the  quarter-rotated  images  (fig.  39).  Figures  40  to  45  show  the  Rocke¬ 
feller  algorithm  performance  under  the  remaining  test  conditions.  It  per¬ 
forms  comparably  to  the  USC  and  MIT  algorithms  imder  these  conditions. 

In  addition,  the  Rockefeller  algorithm  took  the  false-alarm  test  from  Phase 
I.  Figure  46  shows  the  results  for  Rockefeller  along  with  the  MIT  and  USC 
results.  Note  that  the  USC  and  MIT  results  are  from  August  1994,  as  a 
false-alarm  test  was  not  included  in  the  March  1995  test. 


It  is  anticipated  that  other  algorithms  will  be  submitted  for  testing  in  the 
future.  Results  from  these  tests  will  be  published  under  separate  covers  as 
the  need  arises. 


Figure  36.  Large 
gallery  tests:  overall 
scores  (November 
1995). 
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5ure  37.  Large  gallery  tests:  FA  versus  FB  scores  (November  1995). 


Figure  38.  Large  gallery  tests:  duplicate  scores  (November  1995). 
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Figure  40.  Large  gallery  tests:  half  rotation  scores  (November  1995). 
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Figure  46.  False-alarm  test  comparison. 
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Under  the  sponsorship  of  DARPA,  ARL  is  conducting  the  algorithm  devel¬ 
opment  and  facial  database  development  portions  of  the  FERET  program. 
This  program  addresses  the  complex  issues  of  facial  recognition  that  have 
direct  and  daily  applications  to  the  intelligence  and  law  enforcement  com¬ 
munities.  The  FERET  program  is  currently  investigating  techniques  and 
technologies  that  show  significant  promise  in  the  area  of  face  recognition. 
The  long-term  goal  of  the  FERET  program  is  to  transition  one  or  more  of 
these  algorithms  into  a  fieldable  face  recognition  system. 

Face  recognition  is  a  very  difficult  problem  that  is  further  complicated  by 
the  fact  that  there  are  billions  of  people  in  the  world,  but  researchers  have 
images  of  only  a  few  thousand  individuals  and  only  a  small  number  of  im¬ 
ages  for  each  individual.  To  a  human  observer,  the  large  number  of  varia¬ 
tions  in  personal  appearance  that  occur  naturally  appear  normal,  but  for 
the  developers  of  face  recognition  algorithms,  these  produce  large  discrep¬ 
ancies  and,  therefore,  problems  for  the  algorithms.  It  is  this  overall  prob¬ 
lem  of  facial  recognition  that  the  FERET  program  is  addressing. 

The  basic  goal  of  the  Phase  I  test  was  to  baseline  algorithm  performance  on 
a  known  database  so  that  we  can  gauge  performance  and  understand  the 
technical  roadblocks  to  a  viable,  fielded  system.  Before  the  FERET  pro¬ 
gram,  most  research  efforts  that  addressed  the  issue  of  facial  recognition 
used  database  images  that  were  carefully  registered  when  collected.  Since 
the  FERET  database  was  collected  to  address  a  real-world  problem,  it  was 
created  to  be  more  realistic,  although  still  providing  some  control  over  the 
type  and  nature  of  the  images  collected. 

In  support  of  the  Phase  I  test,  a  database  of  over  5000  images  was  collected. 
This  required  numerous  collection  activities  and  a  large-scale  effort  to 
catalogue  the  images  into  a  database.  This  database  has  been  requested  by 
and  distributed  to  at  least  50  different  research  groups,  greatly  assisting 
researchers  in  the  development  and  performance  evaluation  of  their 
algorithms. 

The  first  phase  of  the  FERET  program,  which  included  the  August  1994 
test  and  evaluation  effort,  was  judged  to  be  very  successful.  Accomplish¬ 
ments  during  Phase  I  included  the  following: 

1.  For  the  first  time  in  face-recognition  development,  the  performance  of  sev¬ 
eral  algorithms  was  established  against  a  common  baseline. 

2.  The  state  of  the  art  was  significantly  advanced  in  the  area  of  face  recogni¬ 
tion.  At  the  start  of  the  program,  algorithms  worked  on  either  a  small  data¬ 
base  or  on  databases  of  images  collected  under  highly  controlled  condi¬ 
tions.  At  the  end  of  Phase  I,  algorithms  were  working  with  databases  of  up 
to  500  individuals  collected  under  semi-controlled  conditions. 

3.  A  database  of  facial  images  was  established  that  models  real-world 
conditions. 


4.  Areas  for  future  research  were  identified: 

•  Increase  the  size  of  the  database. 

•  Increase  the  number  of  duplicate  images  (images  of  the  same  person 
taken  at  different  times). 

Partly  based  on  the  results  of  the  first  phase  of  the  FERET  program,  MIT, 
TASC,  and  USC  were  chosen  to  continue  their  research  efforts  in  Phase  II. 
Accomplishments  during  Phase  II  included  the  following: 

1.  Face  recognition  algorithms  were  developed  that  were  sufficiently  mature 
that  they  can  be  ported  to  real-time  experimental/demonstration  systems. 

2.  The  size  of  the  FERET  database  was  increased  to  1109  sets  of  images  and 
8525  images.  This  included  225  duplicate  sets. 

3.  TASC  proceeded  with  developing  algorithms  to  extract  shape  from  motion 
in  video  sequences. 

From  the  results  of  the  Phase  II  test,  we  concluded  that  the  overall 
performance  for  face  recognition  algorithms  had  reached  a  level  of  matu¬ 
rity  that  they  should  be  ported  to  a  real-time  experimental/demonstration 
system.  The  goals  of  this  system  will  be  to 

1.  develop  large-scale  performance  statistics  (this  requires  long  runs  over  a 
period  of  weeks  or  months  in  a  controlled  real-world  scenario;  an  example 
is  detecting  and  recognizing  people  as  they  walk  through  a  door  or  portal); 

2.  demonstrate  the  capabilities  of  the  system  to  potential  end  users;  and 

3.  identify  weaknesses  that  cannot  be  determined  in  laboratory  development 
efforts  or  represented  in  databases  collected  under  the  current  image  ac¬ 
quisition  protocol. 

In  the  future,  ARL  will  continue  to  address  the  research  being  conducted 
by  assisting  in  the  development  of  a  larger  and  more  varied  facial  data¬ 
base,  testing  and  evaluating  new  face  recognition  algorithms  being  devel¬ 
oped,  supporting  algorithm  research  and  development,  and  establishing 
baselines  for  human  performance. 

Future  research  into  facial  recognition  will  require  tests  that  are  more  ro¬ 
bust  in  design  and  content.  Tests  relating  to  various  hair  styles,  the  wear¬ 
ing  of  glasses,  increased  variation  in  rotational  angle,  and  inclination/ 
declination  of  the  face  are  only  a  few  of  the  areas  where  future  research  is 
needed.  Future  test  designs  will  require  larger  databases  consisting  of  im¬ 
ages  having  a  larger  range  of  human  variability,  such  as  that  obtained  over 
many  weeks  of  observation.  Future  areas  of  growth  in  the  collection  of  da¬ 
tabase  images  will  include 

1.  images  of  individuals  taken  over  an  extended  period  of  time, 

2.  images  with  a  variety  of  features  (e.g.,  glasses,  facial  hair,  disguises,  etc). 
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3.  images  of  faces  at  different  rotational  poses, 

4.  images  with  various  vertical  head  positions  (inclination  and  decimation  of 
head  up  to  4°),  and 

5.  video  sequences  with  subjects  moving  through  the  field  of  view. 

The  performance  of  face  recognition  algorithms  will  probably  continue  to 
improve.  This  was  reflected  when  MIT  retook  the  March  1995  test  in 
August  1996.  The  results  are  presented  in  appendix  A. 
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Appendix  A.  Further  Testing  at  MIT 

The  development  of  face  recognition  algorithms  is  a  dynamic  process; 
today's  performance  statistics  soon  become  outdated,  as  old  algorithms 
are  improved  and  new  ones  developed.  After  the  March  1995  test,  the 
Massa  Ausetts  Institute  of  Technology  (MIT)  Media  Laboratory  group  con¬ 
tinued  development  of  their  algorithm  and  asked  to  retake  the  March  1995 
test  with  the  new  algorithm.'  The  request  was  granted,  and  on  13  August 
1996,  the  test  was  administered. 


To  support  further  research  in  face  recognition,  after  the  groups  took  the 
March  1995  test.  Army  Research  Laboratory  (ARL)  released  additional  im¬ 
ages  to  those  groups.  The  performance  in  this  appendix  reflects  the  MIT 
group's  use  of  these  additional  data  in  developing  the  algorithm,  and  the 
results  are  compared  only  with  the  results  obtained  with  the  MIT  algo¬ 
rithm  tested  in  March  1995.  Figures  A-1  to  A-3  compare  the  performance 
of  the  March  1995  and  August  1996  algorithms:  overall  scores,  scores  on 
FA  versus  FB  images  (alternative  frontal  images),  and  scores  on  duplicate 
images. 


Figure  A-1. 
Comparison  of 
overall  scores  for 
March  1995  and 
August  1996 
algorithms. 


The  results  show  a  substantial  improvement  on  the  duplicate  images  and 
reflect  a  conserted  effort  to  develop  algorithms  to  address  the  issue  of 
duplicate  images.  Similar  increases  in  performance  can  be  reasonably  ex¬ 
pected  for  all  approaches  tested.  Currently,  there  is  no  definite  set  of  per¬ 
formance  statistics,  because  upper  limits  on  the  ability  of  algorithms  to  rec¬ 
ognize  faces  have  not  been  established. 


Moghaddam,  C.  Nastar,  and  A.  Pentland,  Bayesian  face  recognition  using  deformable  intensity  surfaces.  In  Pro¬ 
ceedings  of  Computer  Vision  and  Pattern  Recognition  96,  pp  638-645, 1996 
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Figure  A-2.  Comparison  of  FA  versus  FB  (alternative  frontal  images)  scores  for  March 
1995  and  August  1996  algorithms. 
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Appendix  B.  Availability  of  Data  for  Outside  Research 

To  advance  the  state  of  the  art  in  face  recognition,  the  Army  Research 
Laboratory  (ARL)  will  make  the  Face  Recognition  Technology  (FERET) 
database  available  to  researchers  in  face  recognition  on  a  case  by  case  ba¬ 
sis.  All  requests  for  the  FERET  database  must  be  submitted  in  writing  to 
the  FERET  technical  agent  at  ARL.  Inquiries  for  further  information  may 
be  made  to  the  Program  Manager  at 

U.S.  Army  Research  Laboratory 
Dr.  P.  Jonathon  Phillips 
AMSRL-SE-RT 
2800  Powder  Mill  Rd 
Adelphi,  MD  20783-1197 

Phone:  301-394-5000 
e-mail:  jonathon@arl.mil 
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Appendix  C.  Research  Release  Form 

George  Mason  University  is  conducting  research  on  automated  means  for 
face  recognition.  The  subjects  are  expected  to  allow  their  pictures  to  be 
taken  in  five  poses:  frontal,  3/4  view,  and/or  profile.  Participation  in  this 
research  is  voluntary.  Full  confidentiality  will  be  maintained  regarding  the 
identity  of  the  subject,  and  coding  for  person-identifiable  data  will  be  done 
with  alphanumeric  tags.  This  project  has  been  reviewed  according  to 
George  Mason  University  procedures  governing  your  participation  in  this 
research.  You  may  also  contact  the  George  Mason  University  Office  for 
Research  at  703-993-2295  if  you  have  any  questions  or  comments  regard¬ 
ing  your  rights  as  a  participant  in  this  research. 

I  vmderstand  that  these  pictures  may  be  published  in  reports  documenting 
the  results  of  this  research. 

I  have  read  this  form  and  agree  to  participate  in  the  study. 

Date:  _ 

Subject  signature:  _ 

Witness:  _ 
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Appendix  D.  Algorithm  Approaches 

D-l.  MIT  Approach 

The  Massachusetts  Institute  of  Technology  (MIT)  Media  Laboratory  Face 
Processing  system  consists  of  a  two-stage  object  detection  and  alignment 
stage,  a  contrast  normalization  stage,  and  a  feature  extraction  stage  whose 
output  is  used  both  for  the  recognition  stage  and  for  coding  the  gallery. 
Object  detection  begins  by  locating  regions  in  the  image  that  have  a  high 
likelihood  of  containing  a  face.  It  assumes  that  there  is  a  3:1  ratio  of  pos¬ 
sible  face  scales  (e.g.,  that  people  are  between  x  and  3x  distance  from  the 
camera).  Currently  four  independent  and  parallel  processors  are  used,  one 
designed  for  each  of  the  four  standard  poses  (frontal,  quarter,  half,  and  full 
profile).  This  head  localization  is  performed  by  multiscale  saliency  compu¬ 
tation.  In  addition  to  the  saliency  computation  based  on  likelihood,  the 
current  version  incorporates  likelihoods  based  on  the  first  two  moments  of 
the  grayscale  histogram  (mean  and  variance),  as  well  as  spatial  location. 
Each  of  these  factors  is  incorporated  independently  through  the  Mahala- 
nobis  distances  based  on  previously  computed  means  and  covariances 
from  training  data.  After  the  best  head  location  and  scale  are  determined, 
the  original  image  is  linearly  scaled  and  translated  so  that  the  head  is  cen¬ 
tered  in  the  frame  at  a  fixed  scale. 

Once  the  head-centered  image  is  obtained,  parallel  searches  for  the  four  fa¬ 
cial  features  (the  left  eye,  right  eye,  nose,  and  mouth)  are  conducted  in  es¬ 
sentially  the  same  manner  as  that  for  locating  the  head.  The  saliency 
computation  is  restricted  to  certain  regions  (windows)  in  the  head- 
centered  frame  and  is  also  modulated  by  a  prior  probability  distribution 
for  the  location  of  the  features  in  these  windows.  The  top  N  candidate  loca¬ 
tions  for  each  feature  are  verified  and  pruned  of  false  alarms  based  on  the 
geometrical  constraints  of  a  face  (the  relative  location  of  the  individual  fea¬ 
tures).  An  exhaustive  combinatorial  search  of  all  possible  pairings  of  the 
top  N  candidates  for  the  four  features  is  performed.  For  each  possible  com¬ 
bination  (which  forms  a  candidate  four-node  spatial  graph),  a  likelihood 
score  is  generated  based  on  a  Mahalanobis  distance,  in  terms  of  a 
12-dimensional  feature  vector,  which  consists  of  the  length  and  orientation 
of  the  six  links  of  this  graph.  The  individual  scores  (likelihoods)  of  each 
candidate  location  are  also  taken  into  consideration.  The  final  score  is  the 
product  of  these  four  individual  likelihoods  and  the  likelihood  score  from 
their  geometry. 

The  final  feature  locations  are  then  used  to  warp  the  head-centered  image 
so  as  to  align  the  detected  feature  locations  with  those  of  a  canonical 
model.  A  rigid  transform  is  used  based  on  the  locations  of  the  two  eyes  in 
the  image  with  those  in  the  canonical  model.  After  scaling  and  alignment, 
the  warped  image  is  masked  so  that  the  background  is  removed.  It  is  then 
normalized  by  linear  remapping  of  the  grayscale  to  a  specified  mean  and 
standard  deviation. 
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Finally,  the  geometrically  aligned  and  normalized  image  is  projected  onto 
a  custom  set  of  eigenfaces,  producing  a  feature  vector  that  is  then  used  for 
recognition,  as  well  as  facial  image  coding  of  the  gallery  images. 

D-2.  Rutgers  University  Approach 

r 

The  Rutgers  University  Center  for  Computer  Aids  for  Industrial  Produc¬ 
tivity  (CAIP)  face-recognition  system  possesses  three  attributes  that  distin¬ 
guish  it  from  other  approaches.  The  first  of  these  is  the  use  of  grayscale 
projections,  wherein  a  two-dimensional  image  of  a  face  is  compacted  into  a 
small  number  of  one-dimensional  signatures.  These  signatures  are  ob¬ 
tained  by  the  addition  of  the  grayscale  values  of  pixels  across  the  image  in 
a  direction  perpendicular  to  the  angle  of  the  signature;  e.g.,  horizontal  pro¬ 
jections  are  obtained  by  the  addition  of  pixels  across  rows,  and  vertical 
projections  are  obtained  by  the  addition  of  pixels  down  the  columns.  This 
initial  stage  of  data  reduction  greatly  reduces  the  complexity  of  the  subse¬ 
quent  processing  without  sacrificing  significant  amounts  of  information 
necessary  for  recognition.  Because  robustness  to  rotation  of  the  head  about 
the  vertical  axis  was  important,  three  signatures  are  used  as  a  source  of 
features  for  recognition:  the  horizontal  projection  on  the  original  image, 
the  horizontal  projection  of  the  image  electronically  rotated  7°  left  of  the 
center  of  the  face,  and  the  horizontal  projection  rotated  7°  to  the  right. 

The  second  attribute  is  transform  coding  of  the  grayscale  projections. 

Transform  coding  of  the  sampled  projections  decorrelates  the  data,  allows 
for  additional  data  reduction  (elimination  of  high  spatial  frequencies  and 
the  dc  term),  and  distributes  the  local  errors  (e.g.,  due  to  a  smile  or  frown) 
over  all  the  output  samples  in  the  transform  domain.  For  this  effort,  the 
discrete  cosine  transform  (DCT)  was  used.  It  provides  results  closely  ap¬ 
proaching  those  of  the  Karhunen-Loeve  transform  (the  eigenface  approach 
in  two-dimensional  (2D)  systems),  but  can  be  computed  with  a  fast 
algorithm. 

The  third  attribute  is  training  and  classifying  via  the  CAIP-developed 
Neural  Tree  Network  (NTN).  The  NTN  is  a  hierarchical  classifier  that  ef¬ 
fectively  combines  neural  networks  and  decision  trees.  It  can  be  imple¬ 
mented  cost-effectively  on  extremely  simple  hardware,  i.e.,  a  single  re¬ 
programmable  neuron. 

The  CAIP  system  was  designed  to  find  and  identify  people  standing  in 
front  of  a  uniform,  consistently  illuminated  background.  The  first  step  in 
the  process  is  to  segment  the  person  from  the  background  by  the  computa-  » 

tion  of  an  edge  picture  (the  maximum  of  0°,  +45°,  -45°,  +90°  gradients  fol¬ 
lowed  by  thresholding  and  morphological  growing  to  fill  in  gaps).  The 
edge  image  was  used  to  set  all  backgroimd  pixels  in  the  gray  level  image  j 

to  zero.  The  edge  picture  was  also  used  to  locate  the  top,  left,  and  right 
edges  of  the  head.  These  boundaries  established  the  limits  for  horizontal 
and  vertical  projections.  These  projections  are  used  to  locate  the  eyes,  nose, 
and  mouth.  The  locations  of  the  eyes  and  mouth  are  then  used  to  scale  the 
face  to  a  standard  size.  The  final  side  of  the  box  around  the  face  is  gener- 
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ated  by  heuristic  techniques  for  finding  the  top  of  the  forehead  and  the 
chin.  Projections  are  then  computed  within  the  region  from  the  forehead  to 
the  chin.  DCT's  are  computed  on  the  projections,  and  low-pass  spatial  fre¬ 
quency  components  selected  as  features  for  training  and  recognition. 

The  NTN  classifier  performs  at  least  as  well  as  any  direct  distance-based 
^  classifier  in  finding  the  most  likely  candidate  for  recognition.  However, 

testing  results  were  required  to  include  a  rank  order  of  the  50  most  likely 
candidates  for  each  face  presented.  It  was  more  efficient  during  the  tests  to 
1  compute  a  function  of  the  Lj  norm  of  the  distance  between  the  test  vector 

and  the  training  vectors  for  each  member  of  the  database  The  rank¬ 
ing  metric  is  1  -  ^ui/Dumax’  where  was  the  Lj  norm  of  the  distance 
from  the  test  vector  to  the  most  distant  training  vector.  If  the  ranking  met¬ 
ric  is  below  a  given  threshold  (0.6),  the  test  vector  is  rejected  (as  not  be¬ 
longing  to  the  database)  if  its  distance  to  the  mean  vector  of  the  database  is 
greater  than  1.5  times  the  distance  of  the  outermost  member  of  the  data¬ 
base  to  the  mean  of  the  database.  The  metric  is  computed  for  the  feature 
vectors  derived  from  each  of  the  three  projections  (0,±7°),  stored  for  each 
member  of  the  database;  the  largest  is  selected  as  representing  the  distance 
to  that  member.  Then,  these  maximum  values  are  rank-ordered  across  the 
database. 

D-3.  TASC  Approach 

The  major  emphasis  of  the  effort  by  The  Analytic  Science  Company 
(TASC)  is  the  use  of  information  about  the  3D  shape  of  the  face  to  bo^  de¬ 
tect  and  compensate  for  viewing  angle  variation.  Most  approaches  to  face 
recognition  rely  on  low-level  image-pattern  comparisons  to  compute  simi¬ 
larity  between  two  face  images.  If  the  pose  of  the  head  is  not  roughly  the 
same  in  both  images,  these  types  of  comparison  methods  will  produce  in¬ 
correct  results.  As  the  number  of  subjects  in  the  database  increases,  this 
source  of  error  will  become  more  and  more  important. 

The  computation  of  3D  structure  or  position  information  requires  the  use 
of  multiple  views  of  the  subject.  Since  the  3D  pose  of  the  head  cannot  be 
computed  from  one  image,  it  is  not  possible  even  to  detect  this  source  of 
error  if  only  one  image  of  the  subject  is  available.  Under  this  effort,  two 
uncalibrated  views,  frontal  and  profile,  were  considered.  The  profile  view 
provides  information  about  the  relief  of  the  face  that  cannot  be  computed 
from  the  frontal  view.  This  information  can  be  used  to  better  distinguish 
two  subjects  whose  frontal  views  might  be  incorrectly  compared  because 
^  of  differences  in  view  angle  (e.g.,  tUt  of  the  chin).  This  scenario  is  one  of  the 

simplest  multiview  conditions  available,  and  also  describes  a  real-world 
application:  matching  against  traditional  mugshots.  Hence  this  problem  is 
(  valuable  both  in  the  short  term  and  in  the  long  term  as  a  baseline  for  future 

work,  in  which  3D  models  will  be  constructed  from  more  complex  multi¬ 
view  scenarios  (e.g.,  video  sequences). 

The  TASC  system  processes  both  the  frontal  and  profile  views  in  a  similar 
fashion.  Feature  extraction  is  used  first  to  identify  two  fiducials  in  the  im- 
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age  that  are  used  to  perform  geometric  normalization,  including  adjust¬ 
ments  of  image  plane  rotation  and  scale.  Since  the  images  are  uncalibrated, 
these  normalization  factors  are  specific  to  each  view.  Template  regions  are 
extracted  from  the  normalized  images  and  stored  in  the  database  along 
with  the  location  of  fiducials  from  the  original  images.  A  total  of  five  tem¬ 
plate  regions  are  extracted.  At  the  lowest  level,  two  subjects  are  compared 
on  the  basis  of  general  pattern-matching  techniques  with  only  the  ex¬ 
tracted  normalized  templates.  This  comparison  method  performed  quite 
well  on  the  database  provided,  with  the  largest  source  of  error  being  the 
location  of  the  fiducial  points  used  for  geometric  normalization. 

The  system  can  be  run  in  two  modes.  Comparison  can  be  made  on  the  ba¬ 
sis  of  only  the  frontal  view,  or  on  both  views. 

D-4.  use  Approach 

The  general  approach  to  face  recognition  used  by  the  University  of  South¬ 
ern  California  (USC)  Computational  Vision  Lab  is  based  on  the  dynamic 
link  architecture  (DLA)  theory  of  brain  hmetion.  The  program,  known  as 
SCFacerec,  is  an  algorithmic  abstraction  of  DLA  called  elastic  graph 
matching,  which  is  better  suited  for  processing  on  conventional  digital 
computers  than  is  DLA. 

Broadly  speaking,  elastic  graph  matching  finds  a  mapping  between  the 
image  and  model  domains  and  compares  features  sampled  at  correspond¬ 
ing  points  in  the  mapping.  Two  stages  of  elastic  graph  matching  are  used 
by  SCFacerec:  a  spatially  coarser  stage,  in  which  the  face  is  fotmd  and  nor¬ 
malized  with  respect  to  scale  and  position  in  the  image,  and  a  finer  stage, 
in  which  features  of  the  face  are  located  for  comparison  with  a  gallery  of 
mug  shots.  The  same  basic  graph  matching  scheme  is  used  for  both  coarse 
and  fine  stages;  indeed,  many  of  the  same  functions  are  called  in  both 
steps. 

SCFacerec  may  be  broken  down  into  the  following  components,  each  of 
which  is  described  in  more  detail  below:  (1)  a  fiducial  graph,  (2)  lists  of  fea¬ 
tures  or  "jets,"  (3)  a  similarity  function  for  comparing  jets,  (4)  heuristic 
moves  for  registering  the  graph  with  a  facial  image,  and  (5)  a  prior  knowl¬ 
edge  about  faces  for  use  in  graph  matching  (also  known  as  general  face 
knowledge  or  "GFK"). 

The  fiducial  graph  consists  of  a  graph  of  nodes  corresponding  to  anatomi¬ 
cally  identifiable  points  on  the  face.  Choice  of  a  reproducible  set  of  nodes 
for  the  graph  allows  comparison  of  the  same  facial  points  across  different 
poses  and  between  individuals.  Fiducial  graphs  are  also  necessary  for  the 
use  of  differential  weighting  of  graph  nodes  in  recognition  and  to  intro¬ 
duce  jet  transformations  to  account  for  the  effects  of  rotation  in  depth. 

The  system  uses  a  bank  of  multiple-scale  and  multiple-orientation  Gabor 
wavelet  filters  for  feature  extraction.  This  representation  is  based  on  a 
simple  model  of  the  receptive  fields  found  experimentally  in  the  neurons 
of  the  mammalian  primary  visual  cortex.  Use  of  these  features  gives  the 
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system  insensitivity  to  changes  in  absolute  illumination  cmd,  with  the  simi¬ 
larity  measure  described  below,  to  overall  changes  in  contrast  of  an  image. 
Use  of  the  absolute  power  (i.e.,  modulus  of  the  wavelet  transform)  of  ^e 
Gabor  features  leads  to  some  insensitivity  to  the  exact  positioning  of  the 
graph  nodes.  The  responses  to  the  eight  orientations  and  five  spatial  fre¬ 
quencies  of  Gabor  wavelet  filters  used  by  SCFacerec  are  coded  as  a  40- 
dimensional  vector  or  jet. 

Jets  are  extracted  and  compared  at  each  node  of  the  graph  both  in  the 
graph-matching  phase  of  the  algorithm  and  in  comparing  faces  in  probe 
and  gallery  image  lists.  The  generalized  direction  cosine  between  two  jets 
is  used  for  the  comparison.  The  normalization  of  jet  length  in  the  calcula¬ 
tion  of  the  direction  cosine  leads  to  an  insensitivity  to  changes  in  the  level 
of  contrast  in  the  image.  In  positioning  graph  nodes  (locations  to  extract 
jets),  a  similarity  measure  is  used  that  also  takes  into  accoimt  the  phase  of 
the  Gabor  transform.  In  comparing  graphs  for  identity  recognition,  only 
the  magnitude  of  the  transform  is  used. 

The  algorithm  samples  the  image  in  a  hierarchical  fashion  to  determine  the 
position  and  scale  of  the  face.  This  is  effectively  a  three-parameter  search. 
Parameter  changes  or  graph  moves  are  accepted  if  the  match  with  the  GFK 
(explained  below)  is  improved.  Finally,  each  node  is  allowed  to  "diffuse" 
or  move  independently  of  the  rest  of  the  graph  to  improve  the  fit  with  the 
individual  probe  face.  Graphs  are  automatically  positioned  on  both  probe 
and  gallery  faces  by  this  method.  Jets  may  then  be  extracted  and  compared 
at  corresponding  points  in  probe  and  gallery  graphs  for  recognition. 

The  general  face  knowledge  (GFK)  consists  of  a  stack  of  example  faces  on 
which  fiducial  graphs  have  been  positioned  manually.  A  GFK  stack  usu¬ 
ally  contains  between  10  and  70  examples,  depending  on  the  requirements 
of  the  matching  problem.  Once  constructed,  a  GFK  stack  may  be  reused  for 
different  probe  and  gallery  stacks:  reliable  matching  is  fairly  insensitive  to 
the  exact  details  of  the  examples  used  to  construct  the  GFK.  For  each  trial 
position  of  a  graph  node  in  the  matching  process,  the  GFK  is  searched  for 
the  most  similar  jet  at  that  node.  This  information  is  used  to  compute  an 
overall  similarity  of  the  probe  graph  with  the  GFK  stack  and  evaluate 
whether  a  graph  move  improves  or  worsens  the  fit  of  the  graph  to  the 
probe  face. 

The  components  described  above  are  integrated  into  a  system  with  a  con¬ 
venient  graphical  user  interface.  The  system  may  be  run  in  batch  modes, 
for  testing  recognition  performance,  or  in  demo  mode,  where  individual 
images  are  processed  for  recognition. 
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